Skip to content

fix(gemma): keep tied Q8_0 lm_head packed in eager NATIVE_OPTIMIZED path (#178)#179

Merged
michalharakal merged 1 commit into
developfrom
fix/gemma-board-embed-nocopy
Jun 15, 2026
Merged

fix(gemma): keep tied Q8_0 lm_head packed in eager NATIVE_OPTIMIZED path (#178)#179
michalharakal merged 1 commit into
developfrom
fix/gemma-board-embed-nocopy

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Closes part of #178 (the lm_head half).

Problem

FunctionGemma's token_embd is Q8_0 and tied (no separate output.weight), so convertGemmaWeightsPacked dequanted both token_embd and output to FP32 (2×~0.67 GB) — OOM on the 1.9 GB SL2610. output/lm_head is a real matmul weight, not an embedding.

Fix

  • packGemmaKQuant: add Q8_0 (32-elem/34 B blocks → Q8_0BlockTensorData); generalize the row-major→block-major relayout with a blockSize param.
  • convertGemmaWeightsPacked: drop OUTPUT_WEIGHT from the isEmbed FP32 branch so it packs like the other matmul weights and runs on the (NEON) Q8_0 kernel. token_embd stays FP32 (it's gathered) but is now wrapped no-copy via DenseFloatArrayTensorData instead of ctx.fromFloatArray (which allocates a second ~0.67 GB buffer).

Tied embed/lm_head footprint: ~1.34 GB → ~0.76 GB.

Depends on

SKaiNET#736 (fix/q8_0-lazy-transpose) — the engine ops.transpose Q8_0 case, so linearProject can transpose the packed weight without the Byte→Float ClassCastException. Merge/publish that first.

Verification

GemmaQ5KPackedParityTest via composite -PuseLocalSkainet=true (both repos from source): eager load(NATIVE_OPTIMIZED) decodes byte-identically to the FP32 baseline; lm_head packed Q8_0; no crash.

Remaining (#178, separate)

Row-dequant the main token_embd gather (today only the per-layer PLE embed implements RowDequantSource) to drop the last ~0.67 GB for full board fit.

🤖 Generated with Claude Code

…ath (#178)

FunctionGemma's token_embd is Q8_0 and tied, so convertGemmaWeightsPacked was
dequanting BOTH token_embd AND output to FP32 (2×~0.67 GB) — OOM on the 1.9 GB
SL2610. `output`/lm_head is a real matmul weight, not an embedding:

- packGemmaKQuant: add Q8_0 (32-elem/34B blocks → Q8_0BlockTensorData);
  generalize the row-major→block-major relayout with a blockSize param.
- convertGemmaWeightsPacked: drop OUTPUT_WEIGHT from the isEmbed FP32 branch so
  it packs like the other matmul weights and runs on the (NEON) Q8_0 kernel.
  token_embd stays FP32 (it's gathered) but is now wrapped no-copy via
  DenseFloatArrayTensorData instead of ctx.fromFloatArray (which allocates a
  second ~0.67 GB buffer).

Footprint for the tied embed/lm_head drops ~1.34 GB → ~0.67 GB (embed FP32) +
~0.09 GB (packed Q8_0 lm_head). Requires the engine Q8_0 case in ops.transpose
(SKaiNET fix/q8_0-lazy-transpose) so linearProject can transpose the packed
weight.

Verified: GemmaQ5KPackedParityTest (composite -PuseLocalSkainet) — eager
load(NATIVE_OPTIMIZED) decodes byte-identically to the FP32 baseline; lm_head
packed as Q8_0. (token_embd row-dequant gather to drop the last ~0.67 GB is the
remaining follow-up in #178.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 689a283 into develop Jun 15, 2026
0 of 2 checks passed
@michalharakal michalharakal deleted the fix/gemma-board-embed-nocopy branch June 15, 2026 11:45
michalharakal added a commit that referenced this pull request Jun 15, 2026
…e.properties; fix stale Q8_0 test

Real fix for the Build & Test failure (was masked by, then surfaced after, the
JS NPM config-time issue):

1. `gradle.properties`: set `android.dependencyResolutionAtConfigurationTime.disallow=false`.
   AGP's DependencyResolutionChecks fails the build when KGP's KotlinPackageJsonTask
   resolves the Kotlin/JS + Wasm `*NpmAggregated` configs at configuration time
   (we have JS npm deps: ktor-client-js, kotlinx-browser) — `assemble`/`allTests`
   threw `Configuration 'jsNpmAggregated' was resolved during configuration time`
   (gradle#31483), a false positive against KGP's known behaviour. AGP reads this
   option ONLY from the project gradle.properties — NOT from `-P` or the CI's
   ~/.gradle/gradle.properties (which is why the earlier attempts didn't take).
   Reverted those no-op attempts (build.yml/publish.yml `-P`, ci-gradle.properties).

2. `GemmaQuantLayoutTest`: `pack_non_kquant_returns_null` asserted Q8_0 packs to
   null, but #179 added Q8_0 packing — it now returns Q8_0BlockTensorData. Replace
   with `pack_q8_0_produces_block_tensor` + a true-null case (Q4_1).

Verified locally: `clean assemble allTests --no-configuration-cache` is GREEN.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant